Main Analysis
Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc.
• The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends.
• As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates.
• Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.
Data Cleaning
Since the data is very messy, we put many effort on cleaning and extract useful infomation for analysis.
- Convert to correct type
- Consolidate name, region, date
Join same region
region_str <- "africa|asia|canada|latin america (excl mexico)|europe|mexico|middle east|oceania"
inbound_region <- tidy_ntto_inbound_m %>%
filter(grepl(region_str, MixRegion)) %>%
select(Region=MixRegion, Year, Date, Inbound) %>%
group_by(Region, Year, Date) %>%
summarise(TotalInbound=sum(Inbound)) %>%
ungroup
outbound_region <- tidy_ntto_outbound_m %>%
select(Region, Year, Date, Outbound) %>%
group_by(Region, Year, Date) %>%
summarise(TotalOutbound=sum(Outbound)) %>%
ungroup
regional_travel <- inner_join(inbound_region, outbound_region,
by=c("Region"="Region", "Year"="Year", "Date"="Date"))
Challenges
There are several challenges in our project:
- Due to the problems in the data set such as inconsistency, we have to spend much time in cleaning and re-organizing it, which makes the work tedious and laborious.
- We need country level data in some graphs; however, what we can acquire from the dataset is region level. In that case, we have to project the data onto the whole region, which makes the analysis not comprehensive and detailed.
- Shiny is a great tool for creating interactive data visualizations in R; however, we do not have much experience in it, and therefore have to spend time learning it, which is not easy in such a short time.
Analysis
Travel and Tourism Analysis
regional_travel %>%
select(Year, TotalInbound, TotalOutbound) %>%
group_by(Year) %>%
summarise(TotalInbound=sum(TotalInbound), TotalOutbound=sum(TotalOutbound)) %>%
plot_ly(x = ~Year, y = ~TotalInbound, type = 'bar', name = 'Inbound', marker = list(color = 'rgb(55, 83, 109)')) %>%
add_trace(y = ~TotalOutbound, name = 'Outbound', marker = list(color = 'rgb(26, 118, 255)')) %>%
layout(title = 'Yearly Inbound and Outbound',
xaxis = list(
title = "",
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
yaxis = list(
title = 'Number of People',
titlefont = list(
size = 16,
color = 'rgb(107, 107, 107)'),
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
legend = list(orientation = 'h', x = 0, y = 1,
bgcolor = 'rgba(255, 255, 255, 0)', bordercolor = 'rgba(255, 255, 255, 0)'),
barmode = 'group', bargap = 0.15)
The United States is one of the largest destinations for visitors and has a large amount outbound journeys as well. The numbers of inbound and outbound have increased a lot from 2009 to 2010. Between 2010 to 2013, the number of international visitors raised a little bit each year however the amound of outbound is kind of stable. Since 2013, both of them grown a lot and finally achieved 77.5 million international visitations and 85.6 million outbound travellers in 2015.
Naturally we start wondering what are the most popular destination for americans and where are these international vistors come from? To answer these questions, we break down the graph into smaller regions.
p1 <- inbound_region %>%
spread(Region, TotalInbound) %>%
filter(Date>'2008-11') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~africa, name='africa', mode='lines', line = list(color="gray", width = 1)) %>%
add_trace(y=~asia, name='asia', mode='lines', line = list(color="red", width = 1)) %>%
add_trace(y=~canada, name='canada', mode='lines', line = list(color="orange", width = 1)) %>%
add_trace(y=~europe, name='europe', mode='lines', line = list(color="pink", width = 1)) %>%
add_trace(y=~`latin america excl mexico`, name='latin america excl mexico', mode='lines', line = list(color="green", width = 1)) %>%
add_trace(y=~mexico, name='mexico', mode='lines', line = list(color="purple", width = 1)) %>%
add_trace(y=~`middle east`, name='middle east', mode='lines', line = list(color="black", width = 1)) %>%
add_trace(y=~oceania, name='oceania', mode='lines', line = list(color="blue", width = 1))
p2 <- outbound_region %>% spread(Region, TotalOutbound) %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~africa, name='africa', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~asia, name='asia', mode='lines', line = list(color="red", width = 1), showlegend=F) %>%
add_trace(y=~canada, name='canada', mode='lines', line = list(color="orange", width = 1), showlegend=F) %>%
add_trace(y=~europe, name='europe', mode='lines', line = list(color="pink", width = 1), showlegend=F) %>%
add_trace(y=~`latin america excl mexico`, name='latin america excl mexico', mode='lines', line = list(color="green", width = 1), showlegend=F) %>%
add_trace(y=~mexico, name='mexico', mode='lines', line = list(color="purple", width = 1), showlegend=F) %>%
add_trace(y=~`middle east`, name='middle east', mode='lines', line = list(color="black", width = 1), showlegend=F) %>%
add_trace(y=~oceania, name='oceania', mode='lines', line = list(color="blue", width = 1), showlegend=F)
subplot(p1, p2, nrows=2, shareX=T) %>%
layout(title = "Inbound v.s. Outbound",
yaxis = list(title = "Inbound"),
yaxis2 = list(title = "Outbound"),
legend = list(orientation = 'h')
)
After separating each region out, we observed several things:
- Most of the international travellers are come from Canada, Mexico, Europe, and Asia.
- Mexico, Canada, Europe, and Latin America except Mexico are the top destinations for americans.
- Seasonality exists in each line. Usually peak is reached in summer. For example, every year of July, the number of canada vistors reaches its highest peak of the year.
- A boom in amount of visitors from Latin America except Mexico in the begining of 2014 and a boom in amount of people travel to Mexico in the start of 2010.
We further draws inbound and outbound per region to better explore the hidden pattern individually.
p1 <- regional_travel %>%
filter(Region=='africa') %>%
plot_ly(x = ~as.POSIXct(Date), height = 1000) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1)) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1)) %>%
layout(autosize=F)
p2 <- regional_travel %>%
filter(Region=='asia') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p3 <- regional_travel %>%
filter(Region=='canada') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p4 <- regional_travel %>%
filter(Region=='europe') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p5 <- regional_travel %>%
filter(Region=='latin america excl mexico') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p6 <- regional_travel %>%
filter(Region=='mexico') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p7 <- regional_travel %>%
filter(Region=='middle east') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p8 <- regional_travel %>%
filter(Region=='oceania') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
subplot(p1, p2, p3, p4, p5, p6, p7, p8, nrows=8) %>%
layout(title = "Regional Inbound and Outbound",
yaxis = list(title = "Africa"),
yaxis2 = list(title = "Asia"),
yaxis3 = list(title = "Canada"),
yaxis4 = list(title = "Europe"),
yaxis5 = list(title = "Latin America"),
yaxis6 = list(title = "Mexico"),
yaxis7 = list(title = "Middle East"),
yaxis8 = list(title = "Oceania"),
legend = list(orientation = 'h', x = 0, y = 1.005)
)
Africa: There are more and more people from Africa travel to U.S. since 2013, but the number of american who go to Africa is very stable for past 7 years.
Asia: More and more Asias come to U.S, however, less Americans go to Asia after May 2010. Another thing to notice is the number of visitors from Asia grows faster in July compare to other month. This is probabaly due to the increase of foreign students.
Canada:
Europe:
Latin America except Mexico:
Mexico: A boom of number of outbound happens in the end of 2009. We did a lot research online and found news about with topics about More Mexicans Leaving Than Coming to the U.S. So we think the
Middle East: Both the number of inbound and outbound increase by year.
Oceania: Like Asia, less of U.S. citizens visit Oceania region since mid 2010.
Then we move to the Outbound of all regions, from graph belowe we observe several things: 1. Mexico is the number one Outbound country. 2. A huge pump happend on 2009 for Mexico. 3. Outbound number of mexico is increasing, however for other regions it seems stable. 4. Like Inbound, Outbound shows a seasonal pattern as well.
Since Canada and Mexico dominate the number of people and the interest of different behaviour per region, we start looking at inbound and outbound per region.
Spend Analysis
yearly_spend <- tidy_ntto_spend_y %>%
filter(Region!='european union', Region!='south-central america', Region!='overseas') %>%
mutate(Region=recode(Region, "asia-pacific"="asia"), Spend=Spend*1000000) %>%
select(-Missing) %>%
arrange(Region, Year, Type, Category)
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
ungroup %>%
plot_ly(x = ~Year) %>%
add_trace(y=~`Payments (imports)`, type="scatter", name='Payments (imports)', mode = 'lines+markers', line = list(color="blue", width = 1)) %>%
add_trace(y=~`Receipts (exports)`, type="scatter", name='Receipts (exports)', mode = 'lines+markers', line = list(width = 1)) %>%
layout(title = "Yearly Spending",
xaxis = list(title = "Year"), yaxis = list(title = "Spend"),
legend = list(orientation = 'h', x = 0.5, y = 1.005))
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
plot_ly(x = ~Year, y = ~`Payments (imports)`, type = 'bar', name = 'Payments', marker = list(color = 'rgb(55, 83, 109)')) %>%
add_trace(y = ~`Receipts (exports)`, name = 'Receipts', marker = list(color = 'rgb(26, 118, 255)')) %>%
layout(title = 'Yearly Spending',
xaxis = list(
title = "",
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
yaxis = list(
title = 'Spend (Billion $)',
titlefont = list(
size = 16,
color = 'rgb(107, 107, 107)'),
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
legend = list(orientation = 'h', x = 0, y = 1,
bgcolor = 'rgba(255, 255, 255, 0)', bordercolor = 'rgba(255, 255, 255, 0)'),
barmode = 'group', bargap = 0.15, bargroupgap = 0.1)
yearly_spend %>%
filter(Type=="Payments (imports)") %>%
select(-Type) %>%
group_by(Region, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Region, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~africa, type="scatter", name='africa', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~asia, type="scatter", name='asia', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~europe, type="scatter", name='europe', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~`latin america`, type="scatter", name='latin america', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~`middle east`, type="scatter", name='middle east', mode = 'lines+markers', line = list(width = 2)) %>%
layout(title = "Payments (imports)",
xaxis = list(title = "Year"),
yaxis = list(title = "Spend"),
legend = list(orientation = 'h'))
yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type) %>%
group_by(Region, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Region, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~africa, type="scatter", name='africa', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~asia, type="scatter", name='asia', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~europe, type="scatter", name='europe', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~`latin america`, type="scatter", name='latin america', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~`middle east`, type="scatter", name='middle east', mode = 'lines+markers', line = list(width = 1)) %>%
layout(title = "Receipts (exports)",
xaxis = list(title = "Year"),
yaxis = list(title = "Spend"),
legend = list(orientation = 'h'))
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~`Payments (imports)`, type="scatter", name='Payments (imports)', mode = 'lines+markers', line = list(color="blue", width = 1)) %>%
add_trace(y=~`Receipts (exports)`, type="scatter", name='Receipts (exports)', mode = 'lines+markers', line = list(width = 1)) %>%
layout(title = "Africa",
xaxis = list(title = "Year"), yaxis = list(title = "Spend"),
legend = list(orientation = 'h', x = 0.5, y = 1.005))
tidy_wb_gdp %>%
filter(CountryCode=="USA", Year>2000, Year<2016) %>%
select(Year, GDP) %>%
mutate(Year=factor(Year)) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~GDP, type="scatter", name='US', mode = 'lines+markers', line = list(width = 1))
yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type, -Region) %>%
group_by(Category, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Category, TotalSpend) %>%
plot_ly(x = ~Year, y = ~Education, type = 'bar', name = 'Education') %>%
add_trace(y = ~`Medical/Short-Term Workers`, name = 'Medical/Short-Term Workers') %>%
add_trace(y = ~`Other Business/Other Personal Travel`, name = 'Other Business/Other Personal Travel') %>%
layout(yaxis = list(title = 'Spend'), barmode = 'group')
yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type, -Region) %>%
group_by(Category, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Category, TotalSpend) %>%
plot_ly(x = ~Year, y = ~Education, type = 'bar', name = 'Education') %>%
add_trace(y = ~`Medical/Short-Term Workers`, name = 'Medical/Short-Term Workers') %>%
add_trace(y = ~`Other Business/Other Personal Travel`, name = 'Other Business/Other Personal Travel') %>%
layout(yaxis = list(title = 'Spend'), barmode = 'group')
Finally, we select several region and combine inbound, outbound with GDP